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Abstract 

Background: In computational biology, a novel knowledge has been obtained mostly by identifying 'intra-relation,' 
the relation between entities on a specific biological level such as from gene expression or from microRNA 
(miRNA) and many such researches have been successful. However, intra-relations are not fully explaining complex 
cancer mechanisms because the inter-relation information between different levels of genomic data is missing, e.g. 
miRNA and its target genes. The 'inter-relation' between different levels of genomic data can be constructed from 
biological experimental data as well as genomic knowledge. 

Methods: Previously, we have proposed a graph-based framework that integrates with multi-layers of genomic 
data, copy number alteration, DNA methylation, gene expression, and miRNA expression, for the cancer clinical 
outcome prediction. However, the limitation of previous work was that we integrated with multi-layers of genomic 
data without considering of inter-relationship information between genomic features. In this paper, we propose a 
new integrative framework that combines genomic dataset from gene expression and genomic knowledge from 
inter-relation between miRNA and gene expression for the clinical outcome prediction as a pilot study. 

Results: In order to demonstrate the validity of the proposed method, the prediction of short-term/long-term 
survival for 82 patients in glioblastoma multiforme (GBM) was adopted as a base task. Based on our results, the 
accuracy of our predictive model increases because of incorporation of information fused over genomic dataset 
from gene expression and genomic knowledge from inter-relation between miRNA and gene expression. 

Conclusions: In the present study, the intra-relation of gene expression was reconstructed from inter-relation 
between miRNA and gene expression for prediction of short-term/long-term survival of GBM patients. Our finding 
suggests that the utilization of external knowledge representing miRNA-mediated regulation of gene expression is 
substantially useful for elucidating the cancer phenotype. 



Introduction 

DNA microarrays have already been widely used for the 
classification of tumor subtypes or clinical outcomes for 
the diagnosis, treatment, or prognosis of cancer for 
many years [1-6]. In addition to gene expression, there 
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have been attempts at cancer clinical outcome predic- 
tion using different levels of genomic data such as copy 
number, DNA methylation, or miRNA [7-11]. Despite 
these efforts, however, the elucidation of cancer pheno- 
types remains problematic since the cancer genome is 
neither simple nor independent but is complicated and 
dysregulated by multiple mechanisms in the biological 
system [12,13]. Previously, we have proposed a graph- 
based framework that integrates with multi-layers of 
genomic data, copy number alteration, DNA methyla- 
tion, gene expression, and miRNA expression, for 
the prediction of clinical outcomes in glioblastoma 



© 2013 Kim et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons 
BiolVlGCl C^ntrBl Attribution License ihttp://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in 
any medium, provided the original work is properly cited. 



Kim ef al. BMC Systems Biology 2013, 7(Suppl 3):S8 
http://www.biomedcentral.eom/1752-0509/7/S3/S8 



Page 2 of 1 1 



multiforme (GBM) and serous cystadenocarcinoma [14]. 
The strengths of our approach were also highlighted as 
initiating its application using multiple scales and com- 
putation efficiency [15]. 

In computational biology, a novel knowledge has been 
obtained mostly by identifying 'intra-relation,' the rela- 
tion between entities on a specific biological level such 
as from gene expression or from microRNA (miRNA) 
and many such researches have been successful 
[14,16,17]. However, intra-relations are not fully explain- 
ing complex cancer mechanisms because the inter-rela- 
tion information between different levels of genomic 
data is missing, e.g. miRNA and its target genes. The 
'inter-relation' between different levels of genomic data 
can be constructed from biological experimental data as 
well as genomic knowledge. 

There are possible inter-relationships between the 
genomic features belonging to different levels of geno- 
mic data such as 'miRNA-target genes,' 'copy number 
alteration region-genes located in the altered region,' 
'DNA methylation site-specific genes regulated by pro- 
moter regions,' etc. However, the limitation of previous 
work was that we integrated with multi-layers of geno- 
mic data for cancer clinical outcome prediction without 
considering of inter-relationship information between 
genomic features [14]. We assume that accuracy of pre- 
diction model increase when considering of inter-rela- 
tionship between different levels of genomic data 
because of incorporation of information fused over 
genomic dataset and genomic knowledge, providing an 
enhanced global view on interplays in cancer mechan- 
isms [12,18]. Therefore, when integrating multi-layers of 
genomic data, it will be desirable that a framework will 
be capable of containing the inter-relationships between 
genomic features belonging to different layers of the 
biological system. 

In this paper, we propose a new integrative framework 
that combines genomic dataset from gene expression 
and genomic knowledge from inter-relation between 
miRNA and gene expression for the clinical outcome 
prediction as a pilot study. miRNAs are involved in the 
post-transcriptional regulation of genes either by indu- 
cing degradation of the transcript of their multiple tar- 
gets or by repressing the translation of mRNA into 
protein [19,20]. In addition, miRNAs regulate many 
genes associated with different biological processes such 
as development, stress response, apoptosis, proliferation, 
and tumorigenesis [21-25]. In order to demonstrate the 
validity of the proposed method, the prediction of short- 
term/long-term survival for 82 patients in GBM was 
adopted as a base task. GBM is the most common and 
aggressive primary brain tumor in adults [26], and 
notorious for its tendency to recur [27]. Despite recent 
advances in the molecular pathology of GBM, the 



underling molecular mechanisms associated with clinical 
outcome are still poorly understood [28]. 

The remainder of the paper is organized as follows. 
Data description and methods for prediction based on 
intra-relation among mRNAs and prediction based on 
inter-relation from miRNA to mRNA are explained in 
the Materials and Methods section. In the Results sec- 
tion, experimental results and biological implications are 
provided to demonstrate the validity and effectiveness of 
our proposed approach. Finally, we discuss the meaning 
of our study and future works in the last section. 

Materials and methods 

Data 

Normalized datasets were retrieved from the Cancer 
Genome Atlas (TCGA) data portal (http://tcga-data.nci. 
nih.gov/). A binary classification problem was set using 
the survival information from patient. In the classifica- 
tion of short-term or long-term survival, 'long-term' 
represents samples derived from patients who survived 
longer than 24 months [29]. The total 82 patients' 
records were available across the miRNA and gene 
expression data sets (N = 82), in which 54 were short- 
term survival while the remaining were long-term 
survival. 

Retrieving mRNA targets of miRNA 

There is a many-to-many relationship between miRNAs 
and mRNAs since a single miRNA targets multiple 
mRNAs or a single mRNA is targeted by multiple miR- 
NAs. In order to get target relations between miRNA 
and mRNA, we used miRecords which is integrated 
resources of miRNA that store target interactions pro- 
duced by 11 established miRNA target prediction pro- 
grams [30]. We created 10 variations for predicted 
target pairs between miRNA and genes by considering 
the number of positive voters from the included algo- 
rithms by miRecords (Additional file 1). Since most of 
the evaluation results from these variations were largely 
comparable, the most representative variation # 6 in 
Additional file 1 was used to describe the overall study 
results in the following sections. 

Prediction based on intra-relation among mRNAs 

We used a graph-based semi-supervised learning (SSL) as 
a classification algorithm, which is a halfway learning 
scheme between supervised and unsupervised learning 
[31-34]. The graph-based SSL takes advantage of compu- 
tational efficiency and representational ease for the biolo- 
gical system. The learning time of graph-based SSL is 
nearly linear with the number of graph edges while the 
accuracy remains comparable to the kernel-based meth- 
ods that suffer from the relative disadvantage of a longer 
learning time [16,35]. In addition, the interpretation of 
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biological phenomena can be improved because of the 
graph structure [36-38], which naturally fits into the 
graph based SSL. 

In this study, the entity of intra-relation or inter-rela- 
tion is a patient. We define the intra-relation as a graph 
constructed based on single genomic data alone such as 
gene expression data. On the other hands, we define the 
inter-relation as a graph constructed based on relation- 
ship between different levels of genomic data such as 
gene expression and miRNA data. If two patients' sam- 
ples were more closely related than to others, we 
assumed that the clinical outcomes of those two patients 
were more likely to be similar. Thus, clinical outcome 
prediction can be done by considering similarities 
between patient samples. A natural method of analyzing 
relationships between entities is a graph, where nodes 
represent patients and edges show their possible 



(A) 




Gene expression (G 0 ) 

(B) 




relations. Figure 1 (A) represents an example graph, 
which was conducted using the gene expression. An 
annotated patient is labeled either by '-Y or 'V, indicat- 
ing the two possible clinical outcomes, either 'short- 
term survival' or 'long-term survival.' In order to predict 
the label of the unannotated patient '?', the edges con- 
nected from/to the patient play an important role in 
influencing propagation between the patient and its 
neighbors. This idea can be easily formulated using 
graph-based semi-supervised learning [34]. Edges repre- 
sent relations, more specifically similarities between 
patients that may be extracted from different genomic 
data of gene expression or miRNA. Different types of 
data produce different graphs. Consequently, clinical 
outcome prediction can benefit by integrating diverse 
graphs from genomic data or genomic knowledge, rather 
than relying only on single genomic data that may have 



CO 



Gene expression (Gr) 



Inter-relationship 



miRNA 




Gene expression (G D50 ) 

(D) 

/ / + / ® r, ® 7 

/ (?) - x 0> / / G) © / / ay & 



G D50 G R G A 

Figure 1 Example model of the original, damaged, reconstructed, and augmented graphs. (A) G 0 : Original graph from gene expression 
(B) G D50 : Gene expression graph with 50 percent of damaged edges (C) G R : Reconstructed graph via inter-relationship between miRNA and 
gene expression. Red lines represent edges from inter-relation and dashed lines shows the edges from the original graph. (D) G A : Augmented 
graph by combining 50 percent of damaged graph and reconstructed graph 
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possible limitations, i.e. incomplete information and 
noise. Technically, the data-setup of our experiment for 
the binary classification can be rephrased as {x n ,y n }^ =1 
where x„ e R d (d is the number of features and N is the 
number of patients) and y„ € {—1, 1} . 

Graph-based semi-supervised learning In the graph- 
based SSL, a patient x t (i = 1, n) is represented as a 
node i in a graph, and the relationship between patients 
is represented by an edge. The edge strength from each 
node / to each other node i is encoded in element w t j of 
a n x n symmetric weight matrix W. A Gaussian func- 
tion of Euclidean distance between patients was used to 
state connection strength: 



Wii 



exp 



if i- 



(1) 



otherwise. 



Nodes i, j are connected by an edge if i is in j's k- 
nearest-neighborhood or vice versa. The labeled nodes 
have labels yi e {-1, 1}, whereas the unlabeled nodes 
have zeros y u = 0. An output of graph-based SSL is an 
w-dimensional real-valued vector / = [fi r f u T ] T = (/, ...,//, 
fl+v ■■■> fn = i+u) T > which can be thresholded to create 
label predictions on / = f lt ...,/„ after learning. Graph- 
based SSL consists of two main conditions, which are 
loss condition and smoothness condition. It is assumed 
that / should be close to the given label y t in labeled 
nodes as a loss condition, and overall,/ should not be 
too different from the/ of adjacent nodes as a smooth- 
ness condition. One can obtain /by minimizing the fol- 
lowing quadratic functional [31,33,34]: 



min (f - 

f 



Y) T (f-Y) + rf T Lf 



(2) 



where y=(yi, yi, 0, ... 0) , and the matrix L, called 
the graph Laplacian matrix [39], is defined as L = D - W 
where D = diag(<i,), d t - ZjWy. The parameter p trades off 
loss versus smoothness. The solution of this problem is 
obtained as 



f={I + l iLT l y 
where / is the identity matrix. 



(3) 



Prediction based on inter-relationship from miRNA to 
mRNA 

The main problem of this study is to develop an ade- 
quate measure to calculate the similarity matrix contain- 
ing inter-relationship information between miRNA and 
gene expression. There are many measures to construct 
the similarity matrix for graph-based semi-supervised 
learning such as /r-NN graphs, e-NN graphs, tanh- 



weighted graphs, exp-weighted graphs, etc [32]. For 
these methods, there is an assumption that the length of 
vector from two matrices or matrix itself should be 
same in order to calculate the similarity. However, it is 
difficult to calculate the similarity matrix containing 
inter-relationship information between miRNA and tar- 
get genes because the length of vector from two 
matrices is different, for example 534 miRNAs and 
12,043 genes in miRNA and gene expression, respec- 
tively (Figure 2 (A)). Thus, a new measure for calculat- 
ing the similarity matrix containing inter-relationship 
information from different levels of genomic data has 
been developed in this study (Figure 2 (B)). 

MicroRNA dataset is represented by i patients (i = 1, 
N) and / miRNAs (/ = 1, N ml ) and gene expression data- 
set is represented by /' patients (j = 1, N) and m genes 
{m = 1, Af G ) (Figure 2 (A)). The edge strength from 
each miRNA patient to each gene expression patient is 
encoded in element w t j of an NxN weight matrix. A weight 
matrix containing inter-relationship information between 
miRNA and target genes is obtained by 



fy = ^2 ^2 miRNA(i, I) • gene(j, m) 

1=1 m=l 



(4) 



where w-th gene is targeted by l-th miRNA. After 
calculating/;, each element is normalized and trans- 
formed by 



Z til 
" stdtf) 



Wa = 



l+e~ 



(5) 



(6) 



Integration of multiple graphs In order to combine 
the graph from gene expression and the reconstructed 
graph via inter-relationship, two graphs can be inte- 
grated from finding optimum combination coefficients. 
Information from each graph is regarded as partially 
independent from and partly complementary to others. 
Reliability might be improved by integrating all available 
heterogeneous data using the method proposed by 
Tsuda et al. (2005), which has been re-validated on the 
extended problem of protein function classification [17] 
and clinical outcome prediction using multi-levels of 
genomic data [14]. Based on the method, the integration 
of multiple graphs was conducted through finding an 
optimum coefficient of the linear combination for the 
individual graphs. This corresponds to finding the com- 
bination coefficients a for the individual Laplacians of 
the following mathematical formulation: 
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Figure 2 Graphical data description. (A) Data structure of miRNA, gene expression and their target relation (B) Similarity matrix containing 
inter-relation between miRNA and gene expression 



if 

rnin y T (J + ^ a k L k ) ~ 1 y, ^ a k < /x (7) 

fe=i k 

where K is the number of graphs and L k is the corre- 
sponding graph-Laplacian of graph G k . Similar to the 
output prediction for single graphs, the solution is 
obtained by 

if 

/ = (/ + £> fe L fe )-y (8) 

Experimental setting 

In order to evaluate the effect of inter-relation betwee n 
miRNA and target genes, the intra-relation of gene 
expression was reconstructed from inter-relation 
between miRNA and gene expression. We defined the 4 
cases of graph for demonstrating the validity of the pro- 
posed method (Figure 1). 

(A) Original graph from gene expression (G Q ): We 
made an original graph from gene expression data 
where nodes depict patients and edges represent 
their possible relations. 

(B) Damaged graph from the original graph (G D ): 
We randomly reduced the edges from the original 
graph, Go, in order to make the incomplete graph. 
Gd50 means the gene expression graph with 50 per- 
cent of damaged edges. 



(C) Reconstructed graph via inter-relationship (G R ): 
Reconstructed graph of gene expression was gener- 
ated via inter-relationship between miRNA and gene 
expression. 

(D) Augmented graph (G A ): An augmented graph 
was generated by combining damaged graph (G D ) 
from the original graph and reconstructed graph 
(G R ) from inter-relation. 

Since genomic data sources are generally high dimen- 
sional and noisy, and contain many redundant features, 
which may incur computational difficulty and low accu- 
racy, a Student i-test based feature selection method 
was used [40]. Even though there are many feature 
selection techniques such as filter, wrapper, and 
embedded method [41], a simple univariate feature 
selection method was used in order to emphasize not 
the effect of feature selection but the effect of integra- 
tion with inter-relationship between miRNAs and target 
mRNAs. 

Results 

The receiver operating characteristic (ROC) curve plots 
sensitivity (true positive rate) as a function of 1-specifi- 
city (false positive rate) for a binary classifier system as 
its discrimination threshold is varied [42]. For each pro- 
blem, we calculated area under the curve (AUC) of 
ROC as a performance measure. Each experiment is 
repeated three times in order to estimate the variance of 
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Percent of Damaged Edges 

Figure 3 Performance comparison of 4 cases of graphs. G Q : Original graph from gene expression (brown-dashed line), G D : Gene expression 
graph with damages (blue line), G R : Reconstructed graph via inter-relationship between miRNA and mRNA (dark green-dashed line), G A : 
Augmented graph by damaged graph and reconstructed graph (red line) 



the measurement values and five-fold cross-validation 
was conducted in order to overcome over-fitting. The 
Wilcoxon signed-rank test was used to assess the signifi- 
cance level of difference in performance between the 
results of damaged graphs and augmented graphs [43]. 

Experimental results 

Figure 3 shows the prediction performance on the clas- 
sification of short-term and long-term survival for 4 
cases of proposed graphs. The AUCs of the 4 graphs 
(original graph from gene expression data (G 0 )> 
damaged graph from the original one (G D ), recon- 
structed graph via inter-relation between miRNA and 
mRNA (G R ), and augmented graph by damaged graph 
and reconstructed graph (G A )) are shown in the y axis 
and the percent of damaged edges are represented in 
the x axis. The main result of our study is that the pre- 
diction performance was improved by integrating the 
original gene expression (G Q ) and the reconstructed 
graph via inter-relation between miRNA and mRNA 
(G R ) (Figure 3). We found that the opportunity for suc- 
cess in prediction of clinical outcomes in GBM was 
increased when the prediction was based on the integra- 
tion of genomic data and genomic knowledge based on 
inter-relationship. 

As the percent of damaged edges in gene expression 
graph increased, the AUCs of damaged graph (G D ) are 
getting decreased sharply compared to the original 
graph from gene expression data (Go) (Figure 3). 



However, the performances of the augmented graph 
(Ga) showed robust results even though 90 percent of 
edges were reduced from the original graph. The perfor- 
mance of G A , a graph combining biological experimental 
data and genomic knowledge, is higher than the one of 
Go, an original graph from gene expression only, from 0 
to 30 percent of damaged edges (Figure 3). This sug- 
gests that genomic knowledge is complementary to the 
prediction power of explaining cancer phenotype even 
though biological experimental data such as gene 
expression has incomplete information. 

The significance level of difference in performance 
between the results of damaged graph and augmented 
graph was conducted using Wilcoxon signed-rank test 
(Table 1). The level of significance increased as long as 
the percentage of damaged edges increased. Figure 4 
shows a gradual increase in AUC by augmented graph. 
Dark blue bar represents the results from damaged 
graph and brown bar depicts the one from augmented 

Table 1 Significance test of the performances between 



Gq and Ga 



Percent of damaged edges 


AUC of G D 


AUC of G A 


P-value 


10% 


0.852 


0.864 


1 .80e-03 


30% 


0.844 


0.860 


2.10e-03 


50% 


0.816 


0.836 


1.91e-04 


70% 


0.760 


0.816 


2.38e-04 


90% 


0.648 


0.784 


2.36e-05 
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Figure 4 Improving performance from the augmented knowledge based on inter-relation between mRNA and miRNA. Dark blue bars 
represent the results from damaged graph and brown bar represents the one from augmented graph. Light blue bars indicate the AUCs of the 
original graph and reconstructed graph, respectively 



graph. Light blue bars indicate the AUC of the original 
graph and reconstructed graph, respectively. This pro- 
vides improving performance from the augmented 
knowledge based on inter-relation between mRNA and 
miRNA. 

Biological implication 

Through the proposed model, the molecular signatures 
of miRNA and target genes, most associated with sur- 
vival, were selected. First, miRNAs and gene features 
were separately selected from the prediction model 
based on intra-relation using independent data set, 
miRNA expression and gene expression, respectively. 
Then, miRNA and target gene pairs were selected 
from the prediction model based on inter-relation 
between miRNA and gene expression data. Figure 5 
represents a heatmap of fold changes of selected miR- 
NAs and genes, which are also belonging to selected 
miRNA-target gene pairs. The first column of Figure 5 
shows the fold changes of gene expression from 
selected 11 genes and remaining columns represent 
the fold changes of miRNA expression from selected 
19 miRNAs. Blue cell in the figure indicates that gene 
expression or miRNA expression in the short-term 
survival group is under-expressed compared to the 
long-term survival group. Light blue cell in the heat- 
map represents non-target relation between miRNA 
and gene. Many of these miRNA and target gene pairs 
affect critical biological processes that are frequently 
dysregulated in cancer. 



For instance, three miRNAs, hsa-mir-20a, hsa-mir- 
106a, and hsa-mir-221, were also identified as miRNA 
signatures that predicts survival in Glioblastoma [44]. 
Hsa-mir-20a and hsa-mir-106a miRNAs were classified 
into the protective class and hsa-mir-221 was classified 
into the risk class in the previous study as well [44]. 
The protective miRNAs were expressed at a higher level 
in the long-term survival group compared to the short- 
term survival group while the risky miRNAs were 
expressed at a higher level in the short-term group than 
in the long-term group. The risky and protective class 
of these miRNAs supports the fact that their functions 
being either promoting or inhibitory, respectively. 
Under-expression of hsa-mir-106a has been shown to be 
associated with poor patient survival in colon cancer 
and glioma [45,46]. Target genes of hsa-mir-106a, 
BDH1, UPP1, TUSC2, and KMO, were over-expressed 
in the short-term survival group, which is a reverse pat- 
tern of expression in hsa-mir-106a. These genes play 
important roles that affect metabolic process, cell cycle, 
or nucleotide catabolic process in several cancers 
[47-50]. The miRNA cluster, which contains hsa-mir- 
20a, was found to promote lung cancer growth in vitro, 
activated by c-myc and promote tumor angiogenesis 
[51]. HFE, one of the selected target genes of hsa-mir- 
20a, has been found to be associated with immune 
response in GBM and ovarian cancer [50,52]. Among 
selected miRNA and target gene pairs, other pairs were 
of interest because they could suggest some novel indir- 
ect mechanisms in GBM tumorigenesis. 
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Figure 5 Heatmap of selected miRNA and target gene pairs. The first column shows the fold changes of gene expression from selected 1 ' 
genes and remaining columns represent the fold changes of miRNA expression from selected 19 miRNAs. Blue cells indicate that gene 
expression or miRNA expression in the short-term survival group is under-expressed compared to the long-term survival group. Light blue cells 
represent non-target relation between miRNA and gene. 



Table 2 Description of the selected gene features between short-term and long-term survival group in GBM 

AUC diff 



Gene 


Region 


Function 


Up/ 
down 


RAGE 


14q32.31 


Renal tumor antigen/threonine kinase activity/transferase activity 


Up 


ATAD3A 


1 p36.33 


ATP binding/nucleotide binding 


Down 


HEMK1 


3p21 .31 


DNA binding/N-methyltransferase activity 


Up 


KMO 


1q43 


Integral to membrane/kynurenine 3-monooxygenase activity 


Up 


RBKS 


2p23.2 


D-ribose metabolic process/ribokinase activity 


Up 


CCNJL 


5q33.3 


Nucleus/regulation of progression through cell cycle 


Down 


LGALS8 


1q43 


Extracellular space/sugar binding 


Up 


UPP1 


7p12.3 


Cytoplasm/nucleoside metabolic process/nucleotide catabolic process 


Up 


BDH1 


3q29 


3-hydroxybutyrate dehydrogenase activity/metabolic process/mitochondrial inner membrane/ 
mitochondrial matrix 


Up 


HFE 


6p22.1 


Antigen processing and presentation/ immune response/ protein complex assembly 


Up 


TUSC2 


3p21 .31 


Cell cycle/cell proliferation/cell-cell signalling/negative regulation of progression through cell cycle 


Up 



0.028 
0.024 
0.012 
0.012 
0.012 
0.008 
0.008 
0.008 
0.004 

0.004 
0.000 



The gene lists in the first column were sorted by the AUC_diff, which calculated the difference between the original AUC with 1 1 gene features and the AUC 
without one gene among 1 1 gene features. 
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Table 2 describes the selected gene features between 
short-term and long-term survival group. These gene 
lists were sorted by the AUC_diff, which calculated the 
difference between the original AUC with 11 gene fea- 
tures and the AUC without one gene among 11 gene 
features. The high value of AUC_diff means that the 
contribution of the gene feature, being excluded for cal- 
culating the AUC_diff, to the prediction model is high. 
RAGE showed the highest AUC_diff, 0.028, and AUC_- 
diff of ATAD3A, 0.024, was secondly high among gene 
features (Table 2). 

The RAGE pathway may play an important role in 
STAT3 induction in glioma-associated microglia and 
macrophages, a process that might be mediated through 
S100B [53]. In addition, the under-expression of 
ATAD3A may be involved in the chemosensitivity of oli- 
godendrogliomas and the transformation pathway [54]. 

Comparison with other proposed methods for inter- 
relationship matrix 

Despite the difficulty of developing an adequate measure 
to calculate the similarity matrix containing inter-rela- 
tionship information between miRNA and gene expres- 
sion, we implemented 4 measures, Gr_1, Gr_2, Gr_3, 
and G R _4, and compared with the proposed method, 
G R _5, in order to assess the benefit of the proposed one. 
Gr_1 was calculated by multiplication of correlation 
matrices from gene expression and miRNA expression. 
The method of Gr_2 was generated through the simple 
addition of two vectors, genes and miRNAs, for contain- 
ing inter-relationship. On the other hand, the method of 
Gr_3 was calculated by removing miRNAs and genes, 
which were not belonging to the target relations, after 
simple addition of two vectors, genes and miRNAs. 
Gr_4 was focused on a targeted gene and considered 
multiple miRNAs targeting the specific gene when cal- 
culating the inter-relationship. In contrast to G R _4, 
G R _5, the proposed method in our study, was focused 
on a miRNA and considered multiple target genes from 
the specific miRNA. 

Even though the performance of G R _2 itself showed 
the best (AUC = 0.828), the performance of G A (AUC = 
0.868), integrating G D (AUC = 0.856) and G R _5 (AUC = 
0.796), showed the best in our comparison scheme 
(Figure 6). This suggests that the method of Gr_5 has 
more partly complementary to the gene expression itself 
than the others so that it improves the prediction power 
when integrating with gene expression. 

Conclusions 

In the present study, the intra-relation of gene expres- 
sion was reconstructed from inter-relation between 
miRNA and gene expression for prediction of short- 
term/long-term survival of GBM patients in order to 



>-G R 




Gr_2 G r _3 G r _4 G r _5 

Figure 6 Comparison of other proposed methods. Four 
measures, Gr_1, Gr_2, G r _3, and G R _4, were implemented and used 
for calculating G A in order to assess the effect of the proposed 
method, G R _5. The blue dotted line shows the AUC of original gene 
expression (G 0 ) and the orange dotted line represents the AUC of 
miRNA data alone. 



provide a preliminary insight on the question that is 
how informative inter-relationship between miRNA and 
gene expression is when different levels of genomic 
dataset and valid genomic knowledge are available. 
Based on our results, the accuracy of our predictive 
model increases because of incorporation of information 
fused over genomic dataset from gene expression and 
genomic knowledge from inter-relation between miRNA 
and gene expression. New evidence suggests that geno- 
mic knowledge is complementary to the prediction 
power of explaining cancer phenotype even though bio- 
logical experimental data such as gene expression has 
incomplete information. In addition, our finding sug- 
gests that the utilization of external knowledge repre- 
senting miRNA-mediated regulation of gene expression 
is substantially useful for elucidating the cancer pheno- 
type since miRNAs regulate many genes associated 
with different biological processes such as develop- 
ment, stress response, apoptosis, proliferation, and 
tumorigenesis. 

The present study underpins our on-going work. It is 
expected that the next attempt will be more focused on 
how to utilize the information from 'intra-relation', the 
relation between different levels: from the genome level 
to epigenome, transcriptome, proteome, and further 
stretched to the phenome level. There might be other 
possible intra-relations between different layers of 
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genomic data such as 'copy number alteration region - 
genes located in the alteration region,' 'DNA methyla- 
tion site - specific genes regulated by promoter regions,' 
etc. Thus, when integrating multi-levels of genomic 
data, it might be valuable that a framework will be cap- 
able of containing the inter-relationships between geno- 
mic features belonging to different layers of the 
biological system as genomic knowledge. Even though 
this study is limited to the prediction of short-term/ 
long-term survival in GBM as a base task, the proposed 
framework can be applied to other cancer types or other 
clinical outcomes such as grade, stage, metastasis, etc. In 
addition, we could apply the proposed method to 
another layer of 'intra-relation' based on miRNA expres- 
sion profiles together with 'intra-relation' between 
mRNAs. 

Recently, TCGA has been generating the additional 
cancer genomic data for about 20 to 25 tumor types as 
the second phase of the project. With abundance in dif- 
ferent types of genomic, clinical data and valid genomic 
knowledge, our proposed framework will be valuable for 
explaining the underlying tumorigenesis, eventually lead- 
ing to more effective screening strategies and therapeu- 
tic targets in many types of cancer. 
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